[V. Intercepts vary with carbon level in site soils and slopes vary with fertilizer type][]
[VI. Slope and intercepts vary by site][]
Multi-level models are a workhorse for understanding ecological processes because so many problems contain information at nested spatial scales, levels of organization, or categories. This problem will give you practice implementing the math that you wrote in the model building exercise on N2O emissions from agricultural soils. The deterministic models that we will use here are linear, but the approach applies equally well to non-linear forms. The data set that you will analyze is described in the companion document MultilevelModelBuildingExercise.pdf.
You will write JAGS code to implement models of increasing complexity and power.
There are two sets of problems here, one focusing on model writing, the other focusing on writing code to implement the MCMC in JAGS. I suggest you work this way. Alternate between the model model writing and the model coding. Study the math, then use the math as a template for your code. If we had more time, I would have asked you to write the models.
You need to load some data and libraries. It is always a good idea to look at the data. Examine the head of the data frame for emissions. Note that the columns group.index and fert.index contain indices for sites and fertilizer types. You will need to understand how these are used in the “index trick” covered in lecture. Plot \(N_2O\) emissions as a function of fertilizer input by group and fertilizer type.
library(rjags)
## Loading required package: coda
## Linked to JAGS 4.2.0
## Loaded modules: basemod,bugs
library(reshape)
library(ggplot2)
#library(ESS575)
set.seed(5)
#y=N2OEmission
#w=SiteCarbon
#names(y)
##In case someone has difficulty loading libraries
y=read.csv("/Users/Tom/Documents/EcologicalModelingCourse/_A_Master_Lab_Exercises/Multi-level models NO2/OldFiles/NO_2 emission all data for exercise.csv")
# y <- y[,2:9]
w=read.csv("/Users/Tom/Documents/EcologicalModelingCourse/_A_Master_Lab_Exercises/Multi-level models NO2/OldFiles/Site_level_carbon_data for exercise.csv")
# w <- w[,2:5]
w$mean=w$mean/100 #transform carbon % to proportion
y.n.sites = length(unique(y$group))
head(y)
## X fertilizer group carbon n.input emission reps group.index fert.index
## 1 1 A 14 2.7 180 0.620 13 10 2
## 2 2 A 14 4.6 180 0.450 13 10 2
## 3 3 A 11 0.9 112 0.230 12 7 2
## 4 4 A 38 0.5 100 0.153 14 29 2
## 5 5 A 1 4.0 250 1.000 6 1 2
## 6 6 A 38 0.5 100 0.216 14 29 2
qplot(n.input, emission, data=y, color = group)
qplot(n.input, emission, data=y, color = fertilizer)
Later, you will also need a function to link the sequential indices used in JAGS to the groups (fertilizer and site) in the data. This kind of function is needed when sites or factors in data frames do not correspond to an uninterrupted sequence of integers needed for indices by JAGS. If there are letters or names or non-sequential numbers you will need a function like this to align the jags output with the factors, sites, etc. I provide this function below. It would be a good idea to understand what it does.
(Actually, I think I will be able to show you a way to do this in a much more straightforwad way. Stay tuned. I am almost done.)
group_from_index = function(group, group.index, output ){
#group is a vector of group names or numbers (for factors or sites, etc.)
#group.index is vector of sequential indices to group names or numbers. It is a sequence of integers 1 to length(group)
#output is a matrix or dataframe of JAGS output with number of rows = length(group). Each row contains statistics, etc from JAGS for each group.
a = unique(as.vector(group))
b = unique(group.index)
group.key=as.data.frame(t(rbind(a,b))) #columns containing group indices paired with group name or number
names(group.key)= c(names(as.data.frame(group)), names(as.data.frame(group.index)))
link.column.name = names(group.key)[2] #name of column for merging output data with groups
output2 = cbind(seq(1,nrow(output)),output) #give the output data sequential index the same as the group index
colnames(output2)[1]=link.column.name
group.data=as.data.frame(merge(group.key, output2, by = link.column.name )) #merge the JAGS output with the group names or numbers
return(group.data)
}
Your first task is to write a simple, “pooled” model where you gloss over differences in sites and fertilizer types and lump everything into a set of \(x\) & \(y\) pairs using the R template provided below. It is imperative that you study the data statement and match the variable names in your JAGS code to the left hand side of the = in the data list. Call the intercept alpha, the slope beta and use sigma to name the standard deviation in the likelihood.
#very important to study this:
data = list(
y.emission = log(y$emission),
y.n.input = log(y$n.input)-mean(log(y$n.input)) #center the data to speed convergence and aid in interpretation.
)
inits = list(
list(
alpha = 0,
beta = .5,
sigma = 50
),
list(
alpha = 1,
beta = 1.5,
sigma = 10
)
)
Write the code for the model. Compile the model and execute the MCMC to produce a coda object. Produce trace plots of the chains for model parameters. Produce a summary table for the parameters. Test for convergence using the Gelman diagnostic.
####Pooled model
{
sink("PooledJAGS.R")
cat("
model{
#priors
alpha ~ dnorm(0,.0001)
beta ~ dnorm(0,.0001)
sigma ~ dunif(0,100)
tau.reg <- 1/sigma^2
#likelihood
#note that data have been log-transformed on the R side.
for(i in 1:length(y.emission)){
log_mu[i] <- alpha + beta * y.n.input[i]
y.emission[i] ~ dnorm(log_mu[i], tau.reg)
}
}
",fill=TRUE)
sink()
}
n.adapt=3000
n.update=5000
n.iter= 5000
jm.pooled = jags.model(file="PooledJAGS.R", data=data, n.adapt = n.adapt, inits=inits, n.chains=length(inits))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 563
## Unobserved stochastic nodes: 3
## Total graph size: 1413
##
## Initializing model
update(jm.pooled, n.iter = n.update)
zc.pooled = coda.samples(jm.pooled, variable.names = c("alpha", "beta", "sigma"), n.iter=n.iter)
plot(zc.pooled)
summary(zc.pooled)
##
## Iterations = 8001:13000
## Thinning interval = 1
## Number of chains = 2
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## alpha 0.04315 0.06731 0.0006731 0.0006732
## beta 0.91423 0.10659 0.0010659 0.0010659
## sigma 1.62436 0.04839 0.0004839 0.0006263
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## alpha -0.08639 -0.001671 0.04262 0.089 0.1753
## beta 0.70774 0.842780 0.91256 0.987 1.1215
## sigma 1.53351 1.591545 1.62267 1.656 1.7234
gelman.diag(zc.pooled)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## alpha 1 1
## beta 1 1
## sigma 1 1
##
## Multivariate psrf
##
## 1
Now you will implement the model that allows intercept to vary by group, where each intercept is drawn from a common distribution. Again, use the template provided below to allow you to concentrate on writing JAGS code for the model. Note that you must use the index trick covered in lecture to align the different groups with different intercepts. Here are the preliminaries to set up the model:
data = list(
y.emission = log(y$emission),
y.n.input = log(y$n.input) - mean(log(y$n.input)), #center the data to speed convergence and aid in interpretation. Can recover 0 intercept if needed.
y.group = y$group.index,
y.n.sites = length(unique(y$group))
)
inits = list(
list(
alpha = rep(0,y.n.sites),
beta = .5,
sigma = 50,
mu.alpha= 0,
sigma.alpha = 10
),
list(
alpha = rep(1,y.n.sites),
beta = 1.5,
sigma = 10,
mu.alpha= -2,
sigma.alpha = 20
)
)
Write the model code. Compile the model and summarize coda output. Test for convergence.
{ #note this opening { and the closing } are needed by R markdown but not by R
####Hierarchical model, site level intercept, no site covariate
sink("Hier_1")
cat("
model{
##hyperpriors
mu.alpha ~ dnorm(0,.00001)
sigma.alpha ~ dunif(0,200) #notated varsigma in model documentation
tau.alpha <- 1/sigma.alpha^2
sigma ~ dunif(0,100)
tau.reg <- 1/sigma^2
###priors
for(j in 1:y.n.sites){
alpha[j] ~ dnorm(mu.alpha,tau.alpha)
}
beta ~ dnorm(0,.0001)
####
#likelihood
#note that data have been log-transformed on the R side.
for(i in 1:length(y.emission)){
log_mu[i] <- alpha[y.group[i]] + beta * y.n.input[i]
y.emission[i] ~ dnorm(log_mu[i], tau.reg)
}
}
",fill=TRUE)
sink()
}
n.update=10000
n.iter=25000
jm.hier1 = jags.model("Hier_1", data=data, n.adapt = 3000, inits=inits, n.chains=length(inits))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 563
## Unobserved stochastic nodes: 111
## Total graph size: 2245
##
## Initializing model
update(jm.hier1, n.iter = n.update)
#You would want to include alphas in check for convergnce but I eliminated them here to make output more compact.
zc.hier1 = coda.samples(jm.hier1, variable.names = c("sigma","beta", "mu.alpha", "sigma.alpha"), n.iter=n.iter)
plot(zc.hier1)
summary(zc.hier1)
##
## Iterations = 13001:38000
## Thinning interval = 1
## Number of chains = 2
## Sample size per chain = 25000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## beta 0.8747 0.09374 0.0004192 0.0007068
## mu.alpha 0.1798 0.12993 0.0005811 0.0006956
## sigma 1.0273 0.03405 0.0001523 0.0002276
## sigma.alpha 1.2113 0.09911 0.0004432 0.0007433
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## beta 0.69011 0.8117 0.8754 0.9376 1.0579
## mu.alpha -0.07369 0.0922 0.1800 0.2666 0.4337
## sigma 0.96350 1.0036 1.0263 1.0497 1.0971
## sigma.alpha 1.03103 1.1421 1.2070 1.2751 1.4196
gelman.diag(zc.hier1)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## beta 1 1
## mu.alpha 1 1
## sigma 1 1
## sigma.alpha 1 1
##
## Multivariate psrf
##
## 1
Modify your model to include a covariate at the site level, soil carbon content, as developed in the model writing problem #3.
Set up data and initial conditions:
#######Hierarchical model, site level intercept predicted from carbon concentration covariate and slope varying with fertilizer type.
data = list(
y.emission = log(y$emission),
y.n.input = log(y$n.input)-mean(log(y$n.input)), #center the data to speed convergence and aid in interpretation
y.group= y$group.index,
y.fert = y$fert.index,
y.n.sites = length(unique(y$group)),
y.n.fert = length(unique(y$fertilizer)),
w = log(w$mean/(1-w$mean)) #logit of w$mean
)
y.n.sites = length(unique(y$group))
y.n.fert = length(unique(y$fertilizer))
inits = list(
list(
alpha = rep(0,y.n.sites),
beta = .5,
sigma = 50,
sigma.alpha = 10,
eta = .2,
kappa = .5
),
list(
alpha = rep(-.2,y.n.sites),
beta = 1.5,
sigma = 10,
sigma.alpha = 20,
eta = .2,
kappa = 5
)
)
Write the model code and compile it. Produce JAGS and CODA objects. Summarize the coda object. Test for convergence.
{
sink("Hier_2")
cat("
model{
#priors for within site model######
sigma ~ dunif(0,200)
tau.reg <- 1/sigma^2
beta ~ dnorm(0,.00001)
#priors for intercept model#######
kappa ~ dnorm(0,.00001)
eta ~ dnorm(0, .000001)
sigma.alpha ~ dunif(0,200)
tau.alpha <- 1/sigma.alpha^2
#likelihood for data, note that data are on log scale in data statement on R side.
for(i in 1:length(y.emission)){
log_mu[i] <- alpha[y.group[i]] + beta * y.n.input[i]
y.emission[i] ~ dnorm(log_mu[i], tau.reg)
}
# carbon model for intercept
for(j in 1:y.n.sites){
#use normal because data are centered
mu.alpha[j] <- kappa + eta *w[j]
alpha[j] ~ dnorm(mu.alpha[j],tau.alpha)
}
} #end of model
",fill=TRUE)
sink()
}
n.update=25000
n.iter=25000
jm.hier2 = jags.model("Hier_2", data=data, n.adapt = 3000, inits=inits, n.chains=length(inits))
## Warning in jags.model("Hier_2", data = data, n.adapt = 3000, inits =
## inits, : Unused variable "y.fert" in data
## Warning in jags.model("Hier_2", data = data, n.adapt = 3000, inits =
## inits, : Unused variable "y.n.fert" in data
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 563
## Unobserved stochastic nodes: 112
## Total graph size: 2519
##
## Initializing model
jm.hier2
## JAGS model:
##
##
## model{
## #priors for within site model######
## sigma ~ dunif(0,200)
## tau.reg <- 1/sigma^2
## beta ~ dnorm(0,.00001)
## #priors for intercept model#######
## kappa ~ dnorm(0,.00001)
## eta ~ dnorm(0, .000001)
## sigma.alpha ~ dunif(0,200)
## tau.alpha <- 1/sigma.alpha^2
##
##
##
## #likelihood for data, note that data are on log scale in data statement on R side.
## for(i in 1:length(y.emission)){
## log_mu[i] <- alpha[y.group[i]] + beta * y.n.input[i]
## y.emission[i] ~ dnorm(log_mu[i], tau.reg)
## }
## # carbon model for intercept
## for(j in 1:y.n.sites){
## #use normal because data are centered
## mu.alpha[j] <- kappa + eta *w[j]
## alpha[j] ~ dnorm(mu.alpha[j],tau.alpha)
## }
##
## } #end of model
##
##
## Fully observed variables:
## w y.emission y.group y.n.input y.n.sites
update(jm.hier2, n.iter = n.update)
#You would want to include alphas in check for convergnce but I eliminated them here to make output more compact.
zc.hier2 = coda.samples(jm.hier2, variable.names = c("sigma","eta", "kappa"), n.iter=n.iter)
#zj.hier2 = jags.samples(jm.hier2, variable.names = c(variable.names = c("alpha", "beta", "sigma","eta", "kappa")), n.iter=n.iter)
plot(zc.hier2)
gelman.diag(zc.hier2)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## eta 1 1
## kappa 1 1
## sigma 1 1
##
## Multivariate psrf
##
## 1
summary(zc.hier2)
##
## Iterations = 28001:53000
## Thinning interval = 1
## Number of chains = 2
## Sample size per chain = 25000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## eta 0.3049 0.15288 0.0006837 0.0050440
## kappa 1.3645 0.60860 0.0027218 0.0201402
## sigma 1.0266 0.03376 0.0001510 0.0002232
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## eta 0.006005 0.2032 0.3022 0.4069 0.6113
## kappa 0.174522 0.9572 1.3536 1.7708 2.5854
## sigma 0.962767 1.0032 1.0259 1.0492 1.0948
Modify your model with the covariate at the site level, soil carbon content, to allow slopes to vary with fertilizer type as developed in the model writing problem #4.
Set up data and initial conditions:
#######Hierarchical model, site level intercept predicted from carbon concentration covariate and slope varying with fertilizer type.
data = list(
y.emission = log(y$emission),
y.n.input = log(y$n.input)-mean(log(y$n.input)), #center the data to speed convergence and aid in interpretation
y.group= y$group.index,
y.fert = y$fert.index,
y.n.sites = length(unique(y$group)),
y.n.fert = length(unique(y$fertilizer)),
w = log(w$mean/(1-w$mean)) #logit of w$mean
)
y.n.sites = length(unique(y$group))
y.n.fert = length(unique(y$fertilizer))
inits = list(
list(
alpha = rep(0,y.n.sites),
beta = rep(.5,y.n.fert),
sigma = 50,
sigma.alpha = 10,
eta = .2,
kappa = .5
),
list(
alpha = rep(-.2,y.n.sites),
beta = rep(1.5, y.n.fert),
sigma = 10,
sigma.alpha = 20,
eta = .2,
kappa = 5
)
)
Write the model code and compile it. Produce JAGS and CODA objects. Summarize the coda object. Test for convergence.
{
sink("Hier_3")
cat("
model{
#priors for within site model######
sigma ~ dunif(0,200)
tau.reg <- 1/sigma^2
#priors for intercept model#######
kappa ~ dnorm(0,.00001)
eta ~ dnorm(0, .000001)
sigma.alpha ~ dunif(0,200)
tau.alpha <- 1/sigma.alpha^2
#hyper priors for slope model
mu.beta ~ dnorm(0,.00001)
sigma.beta ~ dunif(0,200)
tau.beta <- 1/sigma.beta
#likelihood for data, note that data are on log scale in data statement on R side.
for(i in 1:length(y.emission)){
log_mu[i] <- alpha[y.group[i]] + beta[y.fert[i]] * y.n.input[i]
y.emission[i] ~ dnorm(log_mu[i], tau.reg)
}
# carbon model for intercept
for(j in 1:y.n.sites){
#use normal because data are centered
mu.alpha[j] <- kappa + eta *w[j]
alpha[j] ~ dnorm(mu.alpha[j],tau.alpha)
}
#Allow slope to vary by fertilizer type
for(k in 1:y.n.fert){
beta[k] ~ dnorm(mu.beta, tau.beta)
}
} #end of model
",fill=TRUE)
sink()
}
n.update=5000
n.iter=2500
jm.hier3 = jags.model("Hier_3", data=data, n.adapt = 3000, inits=inits, n.chains=length(inits))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 563
## Unobserved stochastic nodes: 123
## Total graph size: 3236
##
## Initializing model
update(jm.hier3, n.iter = n.update)
#You would want to include alphas in check for convergnce but I eliminated them here to make output more compact.
zc.hier3 = coda.samples(jm.hier3, variable.names = c("sigma","eta", "kappa","beta", "mu.beta", "sigma.beta"), n.iter=n.iter)
zj.hier3 = jags.samples(jm.hier3, variable.names = c(variable.names = c("alpha", "beta", "sigma","eta", "kappa")), n.iter=n.iter)
plot(zc.hier3)
gelman.diag(zc.hier3)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## beta[1] 1.00 1.02
## beta[2] 1.00 1.00
## beta[3] 1.00 1.00
## beta[4] 1.01 1.03
## beta[5] 1.00 1.00
## beta[6] 1.00 1.00
## beta[7] 1.00 1.00
## beta[8] 1.00 1.00
## beta[9] 1.00 1.01
## beta[10] 1.00 1.00
## eta 1.03 1.11
## kappa 1.03 1.11
## mu.beta 1.00 1.01
## sigma 1.00 1.01
## sigma.beta 1.02 1.02
##
## Multivariate psrf
##
## 1.02
summary(zc.hier3)
##
## Iterations = 8001:10500
## Thinning interval = 1
## Number of chains = 2
## Sample size per chain = 2500
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## beta[1] 0.5326 0.33258 0.0047034 0.0091424
## beta[2] 0.6698 0.20750 0.0029345 0.0046988
## beta[3] 0.8922 0.18292 0.0025869 0.0041755
## beta[4] 0.7799 0.14585 0.0020626 0.0029850
## beta[5] 1.6732 0.42696 0.0060381 0.0163378
## beta[6] 1.0046 0.20532 0.0029036 0.0048909
## beta[7] 1.0482 0.45845 0.0064835 0.0127711
## beta[8] 1.0765 0.32836 0.0046438 0.0065308
## beta[9] 0.7064 0.36652 0.0051834 0.0075970
## beta[10] 1.2063 0.48727 0.0068910 0.0111301
## eta 0.3052 0.16043 0.0022688 0.0161799
## kappa 1.3498 0.63711 0.0090101 0.0688423
## mu.beta 0.9598 0.23726 0.0033554 0.0052297
## sigma 1.0144 0.03376 0.0004774 0.0007139
## sigma.beta 0.3975 0.46282 0.0065453 0.0305515
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## beta[1] -0.15188 0.3129 0.5425 0.7673 1.139
## beta[2] 0.24931 0.5332 0.6750 0.8136 1.065
## beta[3] 0.52959 0.7720 0.8917 1.0142 1.257
## beta[4] 0.48995 0.6827 0.7799 0.8787 1.059
## beta[5] 0.89256 1.3693 1.6651 1.9581 2.535
## beta[6] 0.59386 0.8689 1.0019 1.1404 1.411
## beta[7] 0.16123 0.7551 1.0285 1.3298 2.009
## beta[8] 0.43827 0.8636 1.0635 1.2817 1.752
## beta[9] -0.08893 0.4787 0.7267 0.9475 1.390
## beta[10] 0.34976 0.8780 1.1539 1.4904 2.301
## eta 0.01482 0.1917 0.3008 0.4085 0.646
## kappa 0.19436 0.8990 1.3242 1.7721 2.686
## mu.beta 0.50613 0.8194 0.9507 1.0928 1.452
## sigma 0.95119 0.9909 1.0133 1.0362 1.084
## sigma.beta 0.02389 0.1308 0.2603 0.4814 1.717
The answer will contain some plotting code that displays the results nicely. It would be a good idea to understand this code, but I didn’t want to burden you with producing it!
slopes=t(summary(zj.hier3$beta,quantile, c(.025,.5,.957))$stat) #transpose, the t( ), is important to make next function work
#slopes as function of fetilizer type
group.data=group_from_index(group=y$fertilizer,group.index=y$fert.index,output=slopes)
#table with medians and credible intervals for slopes by fertilizer type
group.data
## group.index group 2.5% 50% 95.7%
## 1 1 N -0.131921131 0.6074782 1.082718
## 2 10 unknown 0.357691204 1.1189286 2.002401
## 3 2 A 0.271622307 0.6916153 1.027257
## 4 3 M 0.540905085 0.9047884 1.218483
## 5 4 AN 0.508505013 0.7912336 1.030657
## 6 5 UM 0.855325252 1.5830632 2.364850
## 7 6 U 0.606326636 1.0043058 1.358286
## 8 7 UA 0.224277869 0.9834985 1.765735
## 9 8 AM 0.488017143 1.0574224 1.642125
## 10 9 UAN -0.001743414 0.7661812 1.305885
#plot of medians and credible intervals for slopes by fertilizer type
names(group.data)[3:5]=c("lower", "median", "upper")
library(ggplot2)
ggplot( group.data, aes(x = group, y = median)) + geom_bar(position = position_dodge(), stat="identity", fill="red") + geom_errorbar(aes(ymin=lower, ymax=upper)) + ggtitle("Medians of slopes by fertilizer type with 95% credible intervals") + # plot title
labs(x="Fertilizer", y=expression(beta)) +
theme_bw() + # remove grey background (because Tufte said so)
theme(panel.grid.major = element_blank()) # remove x and y major grid lines (because Tufte said so)
We now want to allow both slopes and intercepts to vary by site as described in the math exercise. This is a pretty challenging coding problem, so I thought it would be more useful for you to study code that works than to try to write the code yourself. So take a careful look at the following and discuss it with your lab mates.
As usual, we set up data and initial values:
data = list(
y.emission = log(y$emission),
y.n.input = log(y$n.input)-mean(log(y$n.input)), #center the data to speed convergence and aid in interpretation-- there is no such thing as soil with 0 carbon
y.group= y$group.index,
y.fert = y$fert.index,
y.n.sites = length(unique(y$group)),
y.n.fert = length(unique(y$fertilizer))
)
y.n.sites = length(unique(y$group))
B = matrix(nrow=y.n.sites, ncol=2)
B[,1]=0
B[,2]=1.5
inits = list(
list(
B=B,
sigma = 50,
mu.alpha = 0,
mu.beta = 1.5,
sigma.alpha = 10,
sigma.beta = 10,
rho=-.5
),
list(
B=B*.5,
sigma = 20,
mu.alpha = -.2,
mu.beta = .8,
sigma.alpha = 50,
sigma.beta = 50,
rho=.5
)
)
Here is the code for the model. Study it relative to the math.
{
sink("Hier_4")
cat("
model{
#priors for within site model######
sigma ~ dunif(0,200)
tau.reg <- 1/sigma^2
#likelihood for data, note that data are on log scale.
for(i in 1:length(y.emission)){
log_mu[i] <- alpha[y.group[i]] + beta[y.group[i]] * y.n.input[i]
y.emission[i] ~ dnorm(log_mu[i], tau.reg)
}
# Model for group intercept and slope:
for(j in 1:y.n.sites){
alpha[j] <- B[j,1] #group level intercept
beta[j] <- B[j,2] #group level slope
B[j,1:2] ~ dmnorm(B.hat[j,1:2], Tau.B)
B.hat[j,1] <- mu.alpha #required by JAGS syntax
B.hat[j,2] <- mu.beta #required by JAGS syntax
}
mu.alpha ~ dnorm(0,.0001) #mean intercept
mu.beta ~ dnorm(0, .0001) #mean slope
#Inverse of covariance matrix required by JAGS
Tau.B[1:2,1:2] <- inverse(Sigma.B[1:2,1:2])
#Elements of covariance matrix
Sigma.B[1,1] <- sigma.alpha^2
sigma.alpha ~ dunif(0,200)
Sigma.B[2,2] <- sigma.beta^2
sigma.beta ~ dunif(0,200)
Sigma.B[1,2] <- rho*sigma.alpha*sigma.beta # covariance is correlation coef. x product of variances
Sigma.B[2,1] <- Sigma.B[1,2]
rho ~ dunif(-1,1)
} #end of model
",fill=TRUE)
sink()
}
Compile the model and get some output. Red bars give medians; black lines are 95% credible intervals
n.update=50000
n.iter=10000
jm.hier4 = jags.model("Hier_4", data=data, n.adapt = 3000, inits=inits, n.chains=length(inits))
## Warning in jags.model("Hier_4", data = data, n.adapt = 3000, inits =
## inits, : Unused variable "y.fert" in data
## Warning in jags.model("Hier_4", data = data, n.adapt = 3000, inits =
## inits, : Unused variable "y.n.fert" in data
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 563
## Unobserved stochastic nodes: 113
## Total graph size: 2726
##
## Initializing model
update(jm.hier4, n.iter = n.update)
#You should run diagnostics on zc.hier coda object for intecepts and slopes but I eliminated them here to make output more compact
zc.hier4 = coda.samples(jm.hier4, variable.names = c('mu.alpha', "mu.beta", "rho"), n.iter=n.iter)
zj.hier4 = jags.samples(jm.hier4, variable.names = c("alpha", "beta", 'mu.alpha', "mu.beta", "rho"), n.iter=n.iter)
plot(zc.hier4)
gelman.diag(zc.hier4)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## mu.alpha 1 1.00
## mu.beta 1 1.01
## rho 1 1.01
##
## Multivariate psrf
##
## 1
summary(zc.hier4)
##
## Iterations = 53001:63000
## Thinning interval = 1
## Number of chains = 2
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu.alpha 0.1865 0.1358 0.0009605 0.001797
## mu.beta 0.8583 0.1279 0.0009041 0.003298
## rho -0.5584 0.1930 0.0013647 0.008270
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu.alpha -0.08247 0.09463 0.1867 0.2774 0.4534
## mu.beta 0.61017 0.77215 0.8570 0.9462 1.1050
## rho -0.89329 -0.69830 -0.5734 -0.4346 -0.1402
#Make a vector to link sequential index (in ouput) to the group number. Would work the same way if groups were character variables like names. t() is transpose. See group_from_index() function at top of file
slopes = t(summary(zj.hier4$beta, quantile, c(.025,.5,.975))$stat) #transpose is t() is important to make next function work
group.data=as.data.frame(group_from_index(group=y$group,group.index=y$group.index,output=slopes))
names(group.data)[3:5]=c("lower", "median", "upper")
library(ggplot2)
ggplot( group.data, aes(x = group, y = median)) + geom_bar(position = position_dodge(), stat="identity", fill="red") + geom_errorbar(aes(ymin=lower, ymax=upper)) + ggtitle("Medians of site-level slopes with 95% credible intervals") + # plot title
labs(x="Site", y=expression(beta)) +
theme_bw() + # remove grey background (because Tufte said so)
theme(panel.grid.major = element_blank()) # remove x and y major grid lines (because Tufte said so)
plot(density(zj.hier4$mu.beta), xlab = expression(mu[beta]), main="Posterior distribution of mean slope", cex.lab=1.25)
All statistical inference is based on some type of statistical model. A truly fundamental requirement for reliable inference is that the statistical model is capable of giving rise to the data. If this requirement is not met, you have no basis for inference. Statistical theory will not back you up. You are flying blind, proceeding bravely, perhaps foolishly, on your own. These truths motivate the need for model checking. Models that fail checks should be discarded at the outset. (This is not model selection. More about that later.)
The Bayesian approach provides a method, simple to implement, that allows you to check if your model is capable of producing the data. It is called a posterior predictive check. The details of the math are given Hobbs and Hooten 8.1 and were covered in lecture. Here is a brief description of how to code it. The algorithm goes like this:
y[i] ~ dnorm(log_mu[i], tau)
you can simulate a new data set by embedding
y.sim[i] ~ dnorm(log_mu[i], tau)
in the same for loop.
Calculate a test statistic based on the real and the simulated data. The test statistic could be a mean, standard deviation, coefficient of variation, discrepancy, minimum, maximum – really any function that helps you compare the simulated and real data. I think it is wise to always include a test statistic that in someway reflects the variance because it is harder for the model to get the variance
We are interested in calculating a Bayesian p value, the probability that the test statistic computed from the simulated data is more extreme than the test statistic computed from the real data. There is evidence of lack of fit – the model cannot give rise to the data – if the Bayesian p value is large or small. We want values between, say, .10 and .90, ideally close to .5. To obtain this the Bayesian p we use the JAGS step(x) function that returns 0 if x is less 0 and and 1 otherwise. So, presume our test statistic for was the standard deviation. Consider the following pseudo-code:
for(i in 1:length(y)){
mu[i] <- prediction from model
y[i] ~ dnorm(mu[i], tau)
y.sim[i] ~ dnorm(mu[i], tau)
}
sd.data<-sd(y[])
sd.sim <-sd(y.sim[])
p.sd <- step(sd.sim - sd.data)
That is all there is to it. You then include p.sd in your jags or coda object.
Return to the pooled model you developed in the first problem of multi-level modeling exercise. Do posterior predictive checks using the mean, standard deviation, minimum, and discrepancy as test statistics. The discrepancy statistic is is \(\sum_{i=1}^{n}(y_i-\mu_i)^2\) where \(\mu_i\) is the \(i^th\) prediction of your model. Overlay the posterior distribution of the simulated data on the histogram of the read data (density on y axis, not frequency). What do you conclude? Is there any indication of lack of fit? Enough to discard the model?
#preliminaries
library(rjags)
library(reshape)
library(ggplot2)
set.seed(5)
#setwd("/Users/Tom/Documents/Ecological Modeling Course/_A_Master_Lab_Exercises/Multi-level models NO2/")
y.n.sites = length(unique(y$group))
#data for all models except last one
data = list(
y.emission = log(y$emission),
y.n.input = log(y$n.input) - mean(log(y$n.input)) #center the data to speed convergence and aid in interpretation. Can recover 0 intercept if needed.
)
inits = list(
list(
alpha = 0,
beta = .5,
sigma = 50
),
list(
alpha = 1,
beta = 1.5,
sigma = 10
)
)
####Pooled model
{
sink("Pooled")
cat("
model{
#priors
alpha ~ dnorm(0,.0001)
beta ~ dnorm(0,.0001)
sigma ~ dunif(0,100)
tau.reg <- 1/sigma^2
#likelihood
for(i in 1:length(y.emission)){
mu[i] <- alpha + beta * y.n.input[i]
y.emission[i] ~ dnorm(mu[i], tau.reg)
#simulated data for posterior predictive checks
y.emission.sim[i] ~ dnorm(mu[i], tau.reg)
sq.error.data[i] <- (y.emission[i]-mu[i])^2
sq.error.sim[i] <- (y.emission.sim[i] - mu[i])^2
}
#Bayesian P values
sd.data <- sd(y.emission)
sd.sim <- sd(y.emission.sim)
p.sd <- step(sd.sim-sd.data)
mean.data <- mean(y.emission)
mean.sim <- mean(y.emission.sim)
p.mean <- step(mean.sim - mean.data)
discrep.data <- sum(sq.error.data)
discrep.sim <- sum(sq.error.sim)
p.discrep <- step(discrep.sim - discrep.data)
min.data <- min(y.emission)
min.sim <- min(y.emission.sim)
p.min <-step(min.sim-min.data)
}
",fill=TRUE)
sink()
}
n.update=10000
n.iter=10000
n.update = 3000
jm.pooled = jags.model("Pooled", data=data, n.adapt = 3000, inits=inits, n.chains=length(inits))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 563
## Unobserved stochastic nodes: 566
## Total graph size: 8708
##
## Initializing model
update(jm.pooled, n.iter = n.update)
#zc.pooled = coda.samples(jm.pooled, variable.names = c("alpha", "beta", "sigma"), n.iter=n.iter)
zj.pooled = jags.samples(jm.pooled, variable.names = c("alpha", "beta", "sigma", "p.sd", "p.mean", "p.discrep","p.min", "y.emission.sim"), n.iter=n.iter)
zj.pooled$p.sd
## mcarray:
## [1] 0.52375
##
## Marginalizing over: iteration(10000),chain(2)
zj.pooled$p.mean
## mcarray:
## [1] 0.4995
##
## Marginalizing over: iteration(10000),chain(2)
zj.pooled$p.discrep
## mcarray:
## [1] 0.5103
##
## Marginalizing over: iteration(10000),chain(2)
zj.pooled$p.min
## mcarray:
## [1] 1
##
## Marginalizing over: iteration(10000),chain(2)
hist(data$y.emission, breaks=20, freq=FALSE, main="Simulated and real data", xlab=expression(paste("log(", N0[2], " emission)")), cex.lab=1.2) #note that this is the log transformed data
lines(density(zj.pooled$y.emission.sim), col="red")
legend(-10,.2,"simulated", col="red", lty="solid")